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, Abstract 

Ensuring the usefulness of electronic data sources while providing necessary privacy guarantees is an 
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important unsolved problem. This problem drives the need for an analytical framework that can quantify 
the privacy of personally identifiable information while still providing a quantifable benefit (utility) to 
multiple legitimate information consumers. This paper presents an information-theoretic framework that 
promises an analytical model guaranteeing tight bounds of how much utility is possible for a given level 
O ' of privacy and vice-versa. Specific contributions include: i) stochastic data models for both categorical 

and numerical data; ii) utility-privacy tradeoff regions and the encoding (sanization) schemes achieving 
them for both classes and their practical relevance; and iii) modeling of prior knowledge at the user 
■ and/or data source and optimal encoding schemes for both cases. 
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^ | I. Introduction 

Just as information technology and electronic communications have been rapidly applied to almost every 
sphere of human activity, including commerce, medicine and social networking, the risk of accidental or 
intentional disclosure of sensitive private information has increased. The concomitant creation of large 
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centralized searchable data repositories and deployment of applications that use them has made "leakage" 
of private information such as medical data, credit card information, power consumption data, etc. highly 
probable and thus an important and urgent societal problem. Unlike the secrecy problem, in the privacy 
problem, disclosing data provides informational utility while enabling possible loss of privacy at the 
same time. Thus, as shown in Fig. [T] in the course of a legitimate transaction, a user learns some public 
information (e.g. gender and weight), which is allowed and needs to be supported for the transaction to 
be meaningful, and at the same time he can also learn/infer private information (e.g., cancer and income), 
which needs to be prevented (or minimized). Thus, every user is (potentially) also an adversary. 

The problem of privacy and information leakage has been studied for several decades by multiple 
research communities; information-theoretic approaches to the problem are few and far in between and 
have primarily focused on using information-theoretic metrics. However, a rigorous information-theoretic 
treatment of the utility-privacy (U-P) tradeoff problem remains open and the following questions are yet 
to be addressed: (i) the statistical assumptions on the data that allow information-theoretic analysis, (ii) 
the capability of revealing different levels of private information to different users, and (iii) modeling of 
and accounting for prior knowledge. In this work, we seek to apply information theoretic tools to address 
the open question of an analytical characterization that provides a tight U-P tradeoff. If one views public 
and private attributes of data in a repository as random variables with a joint probability distribution, a 
private attribute in a database remains private to the extent that revealing public attributes releases no 
additional information about it - in other words, minimizing the risk of privacy loss implies that the 
conditional entropy of the private attribute should be as high as possible after the disclosure. Thus, in 
Fig. [T] keeping the cancer attribute private would mean that, given knowledge of the public attributes of 
gender and weight, the predictability of the cancer attribute should remain unchanged. To achieve this, 
the gender attribute in Entry 1 has been "sanitized." 

The utility of a data source lies in its ability to disclose data and privacy considerations have the 
potential to hurt utility. Indeed, utility and privacy are competing goals in this context. For example, in 
Fig. [Done could sanitize all or most of the entries in the gender attribute to 'M' to obtain more privacy 
but that could reduce the usefulness of the published data significantly. Any approach that considers only 
the privacy aspect of information disclosure while ignoring the resultant reduction in utility is not likely to 
be practically viable. To make a reasoned tradeoff, we need to know the maximum utility achievable for 
a given level of privacy and vice versa, i.e. an analytical characterization of the set of all achievable U-P 
tradeoff points. We show that this can be done using an elegant tool from information theoiy called rate 
distortion theory: utility can be quantified via fidelity which, in turn, is related (inversely) to distortion. 
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Fig. 1. An example database with public and private attributes and its sanitized version. 



Rate distortion has to be augmented with privacy constraints quantified via equivocation, which is related 
to entropy. 

Our Contributions: The central contribution of this work is a precise quantification of the tradeoff 
between the privacy needs of the individuals represented by the data and the utility of the sanitized 
(published) data for any data source using the theory of rate distortion with additional privacy constraints. 
Utility is quantified (inversely) via distortion (accuracy), and privacy via equivocation (entropy). 

We expose for the first time an essential dimension of information disclosure via an additional constraint 
on the disclosure rate, a measure of the precision of the sanitized data. Any controlled disclosure of public 
data needs to specify the accuracy and precision of the disclosure; while the two can be conflated using 
additive noise for numerical data, additive noise is not an option for categorical data (social security 
numbers, postal codes, disease status, etc.) and thus output precision becomes important to specify. For 
example, in Fig. \T\ the weight attribute is a numeric field that could either be distorted with random 
additive noise or truncated (or quantized) into ranges such as 90-100, 100-110, etc. The use of the digits 
of the social security number to identify and protect the privacy of students in grade sheets is a familiar 
non-numeric example. Sanitization (of the full SSN) is achieved by heuristically reducing precision to 
typically the last four digits. A theoretical framework that formally specifies the output precision necessary 
and sufficient to achieve the optimal U-P tradeoff would be desirable. 

In [1] the rate-distortion-equivocation (RDE) tradeoff for a simple source model was presented. We 
translate this formalism to the U-P problem and develop a framework that allows us to model generic data 
sources, including multi-dimensional databases and data streams [2], develop abstract utility and privacy 
metrics, and quantify the fundamental U-P tradeoff bounds. We then present a sanitization scheme that 
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achieves the U-P tradeoff region and demonstrate the application of this scheme for both numerical and 
categorical examples. Noting that correlation available to the user/adversary can be internal (i.e. between 
variables within a database) or external (with variables that are outside the database but accessible to the 
user/adversary), [3]-[5] have shown that external knowledge can be very powerful in the privacy context. 
We address this challenge in our framework via a model for side information. Our theorem in this context 
reported previously in [6] is presented with the full proof here. 

Finally, we demonstrate our framework with two crucial and practically relevant examples: categorical 
and numerical databases. Our examples demonstrate two fundamental aspects of our framework: (i) how 
statistical models for the data and U-P metrics reveals the appropriate distortion and suppression of 
data to achieve both privacy and utility guarantees; and (ii) how knowledge of source statistics enables 
determining the U-P optimal sanitization mechanism, and therefore, the largest U-P tradeoff region. 

The paper is organized as follows. In Section JI] we briefly summarize the state of the art in database 
privacy research. In Section UTTJ we motivate the need for an information-theoretic analysis and present 
the intuition behind our analytical framework. In Section [TV] we present an abstract model and metrics 
for structured data sources such as databases. We develop our primary analytical framework in Section 
IVl and illustrate our results in Section |VT] We close with concluding remarks in Section IVIII 

II. Related Work 

The problem of privacy in databases has a long and rich history dating back at least to the 1970s, 
and space restrictions preclude any attempt to do full justice to the different approaches that have been 
considered along the way. We divide the existing work into two categories, heuristic and theoretical 
techniques, and outline the major milestones from these categories for comparison. 

The earliest attempts at systematic privacy were in the area of census data publication where data was 
required to be made public but without leaking individuals' information. A number of ad hoc techniques 
such as sub-sampling, aggregation, and suppression were explored (e.g., [7], [8] and the references 
therein). The first formal definition of privacy was ^-anonymity by Sweeney [3]. However A;-anonymity 
was found to be inadequate as it only protects from identity disclosure but not attribute-based disclosure 
and was extended with i-closeness [9] and /-diversity [10]. All these techniques have proved to be non- 
universal as they were only robust against limited adversaries. Heuristic techniques for privacy in data 
mining have focused on using a mutual information-based privacy metrics [11]. 

The first universal formalism was proposed in differential privacy (DP) [4] (see the survey in [12] for 
a detailed history of the field). In this model, the privacy of an individual in a database is defined as 
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a bound on the ability of any adversary to accurately detect whether that individual's data belongs to 
the database or not. They also show that Laplacian distributed additive noise with appropriately chosen 
parameters suffices to sanitize numerical data to achieve differential privacy. The concept of DP is strictly 
stronger than our definition of privacy, which is based on Shannon entropy. However, our model seems 
more intuitively accessible and suited to many application domains where strict anonymity is not the 
requirement. For example, in many wellness databases the presence of the record of an individual is not 
a secret but that individual's disease status is. Our sanitization approach applies to both numerical and 
categorical data whereas DP, while being a very popular model for privacy, appears limited to numerical 
data. Furthermore, the loss of utility from DP-based sanitization can be significant [13]. There has been 
some work pointing out the loss of utility due to privacy mechanisms for specific applications [14]. 

More generally, a rigorous model for privacy-utility tradeoffs with a method to achieve all the optimal 
points has remained open and is the subject of this paper. The use of information theoretic tools for 
privacy and related problems is relatively sparse. [1] analyzed a simple two variable model using rate 
distortion theory with equivocation constraints, which is the prime motivation for this work. In addition, 
there has been recent work comparing differential privacy guarantee with Renyi entropy [15] and Shannon 
entropy [16]. 

III. Motivation and Background 

The information-theoretic approach to database privacy involves two steps: the first is the data mod- 
eling step and the second is deriving the mathematical formalism for sanitization. Before we introduce 
our formal model and abstractions, we first present an intuitive understanding and motivation for our 
approaches below. 

A. Motivation: Statistical Model 

Our work is based on the observation that large datasets (including databases) have a distributional 
basis; i.e., there exists an underlying (sometimes implicit) statistical model for the data. Even in the case 
of data mining where only one or a few instances of the dataset are ever available, the use of correlations 
between attributes used an implicit distributional assumption about the dataset. We explicitly model the 
data as being generated by a source with a finite or infinite alphabet and a known distribution. Each row 
of the database is a collection of correlated attributes (of an individual) that belongs to the alphabet of 
the source and is generated according to the probability of occurrence of that letter (of the alphabet). 
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Our statistical model for databases is also motivated by the fact that while the attributes of an individual 
may be correlated (e.g. between the weight and cancer attributes in Fig. Q]), the records of a large number 
of individuals are generally independent or weakly correlated with each other. We thus model the database 
as a collection of n observations generated by a memoryless source whose outputs are independent and 
identically distributed (i.i.d.). 

Statistically, with a large number n of i.i.d. samples collected from a source, the data collected can 
be viewed as typical, i.e., it follows the strong law of large numbers (SLLN) [17, Ch. 11]. The SLLN 
implies that the absolute difference between the empirical distribution (obtained from the observations) 
and the actual distribution of each letter of the source alphabet decreases with n, i.e., the samples (letters 
from the source alphabet) in the database will be represented proportional to their actual probabilities. 
This implies that for all practical purposes the empirical distribution obtained from a large dataset can 
be assumed to be the statistical distribution of the idealized source for our model and the approximation 
gets better as n grows. 

Our measures for utility and privacy capture this statistical model. In particular, we quantify privacy 
using conditional entropy where the conditioning on the published (revealed) data captures the average 
uncertainty about the source (specifically, the private attributes of the source) post-sanitization. Our utility 
measure similarly is averaged over the source distribution. 

Intuitively, privacy is about maintaining uncertainty about information that is not explicitly disclosed. 
The common notion of a person being undetectable in a group as in [3] or an individual record remaining 
undetectable in a dataset [4] captures one flavor of such uncertainty. More generally, the uncertainty about 
a piece of undisclosed information is related to its information content. Our approach focuses on the 
information content of every sample of the source and sanitizes it in proportion to its likelihood in the 
database. This, in turn, ensures that low probability/high information samples (outliers) are suppressed 
or heavily distorted whereas the high probability (frequent flier) samples are distorted only slightly. 
Outlier data, if released without sanitization, can leak a lot of information to the adversary about those 
individuals (e.g. individuals older than a hundred years); on the other hand, for individuals represented 
by high probability samples either the adversary already has a lot of information about them or they are 
sufficiently indistinct due to their high occurrence in the data, thereby allowing smaller distortion. 

As we show formally in the sequel, our approach and solution for categorical databases captures a 
critical aspect of the privacy challenge, namely, in suppressing the high information (low probability 
outlier samples) and distorting all others (up to the desired utility/distortion level), the database provides 
uncertainty (for that distortion level) for all samples of the data. Thus, our statistical privacy measure 
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captures the characteristics of the underlying data model. 

It is crucial to note that distortion does not only imply distance-based measures. The distortion measure 
can be chosen to preserve any desired function, deterministic or probabilistic, of the attributes (e.g., 
aggregate statistics). Our aim is to ensure that sensitive data is protected by randomizing the public 
(non-sensitive) data in a rigorous and well-defined manner such that: (a) it still preserves some measure 
of the original public data (e.g., K-L divergence, Euclidean distance, Hamming distortion, etc.); and (b) 
provides some measure of privacy for the sensitive data that can be inferred from the revealed data. In this 
context, distortion is a term that makes precise a measure of change between the original non-sensitive 
data and its revealed version; appropriate measures depend on the data type, statistics, and the application 
as illustrated in the sequel. 

At its crux, our proposed sanitization process is about determining the statistics of the output (database) 
that achieve a desired level of utility and privacy and about deciding which input values to perturb and 
how to probabilistically perturb them. Since the output statistics depends on the sanitization process, for 
the i.i.d. source model considered here, mathematically the problem reduces to finding the input to output 
symbol-wise transition probability. 

B. Background: Rate-distortion Theory 

In addition to a statistical model for large data sets, we also introduce an abstract formulation for the 
sanitization process, which is based on the theory of rate-distortion. We provide some intuition for the 
two steps involved in information-theoretic sanitization, namely encoding at the database and decoding 
at the data user. 

For the purposes of privacy modeling the attributes about any individual in a database fall in two 
categories: public attributes that can be revealed and private attributes that need to be kept hidden, 
respectively. An attribute can be both public and private at the same time. The attributes of any individual 
are correlated; this implies that if the public attributes are revealed as is, information about the private 
attributes can be inferred by the user using a correlation model. Thus, ensuring privacy of the private 
attributes (also referred to as hidden attributes in the sequel) requires modifying/sanitizing/distorting the 
public attributes. However, the public attributes have a utility constraint that limits the distortion, and 
therefore, the privacy that can be guaranteed to the private attributes. 

Our approach is to determine the optimal sanitization, i.e., a mapping which guarantees the maximal 
privacy for the private attributes for the desired level of utility for the public attributes, among the set 
of all possible mappings that transform the public attributes of a database. We use the terms encoding 
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and decoding to denote this mapping at the data publisher end and the user end respectively. A database 
instance is an n-realization of a random source (the source is a vector when the number of attributes K > 
1) and can be viewed as a point in an n-dimensional space (see Fig. |2]). The set of all possible databases 
(n-length source sequences) that can be generated using the source statistics (probability distribution) lie 
in this space. 

Our choice of utility metric is a measure of average 'closeness' between the original and revealed 
database public attributes via a distortion requirement D. Thus the output of sanitization will be another 
database (another point in the same n-dimensional space) within a ball of 'distance' nD. We seek to 
determine a set of some M = 2 nR output databases that 'cover' the space, i.e., given any input database 
instance there exists at least one sanitized database within bounded 'distance' nD as shown in Fig. [2] 
Note that the sanitized database may be in a subspace of the entire space because only the public attributes 
are sanitized and the utility requirement is only in this subspace. 

In information theory such a distortion-constrained encoding is referred to as quantization or com- 
pression. Furthermore, the mapping is referred to as vector quantization because the compression is 
of an n-dimensional space and can be achieved in practice using clustering algorithms. In addition to 
a distortion (utility) constraint, our privacy constraint also requires that the "leakage" (i.e. the loss of 
uncertainty) about the private attributes via correlation from the sanitized database is bounded. The set 
of M source-sanitized database pairs is chosen to satisfy both distortion and leakage constraints. The 
database user that receives the sanitized database may have other side-information (s.i.) about which the 
encoder is either statistically informed (i.e., only the statistics of s.i. known) or informed (knows s.i. a 
priori). The decoder can combine the sanitized database published by the encoder and the s.i. to recreate 
the final reconstructed database. 

Obtaining the U-P tradeoff region involves two parts: the first is a proof of existence of a mapping, 
called a converse or outer bounds in information theoiy, and the second is an achievable scheme (inner 
bounds) that involves constructing a mapping (called a code). Mathematically, the converse bounds the 
maximal privacy that can be achieved for a desired utility over the space of all feasible mappings, and the 
achievable scheme determines the input to output probabilistic mapping and reveals the minimal privacy 
achievable for a desired distortion. When the inner and outer bounds meet, the constructive scheme is 
tight and achieves the entire U-P tradeoff, often the case for tractable distributions such as Gaussian, 
Laplacian, and arbitrary discrete sources. 

It is important to note that our assumption of knowledge of the source statistics at all involved 
parties does not limit the applicability of the framework for the following reasons: (a) the statistics 
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Fig. 2. Space of all database realizations and the quantized databases. 

for large data can often be sampled reliably from the data collected; (ii) knowledge of statistics alone is 
insufficient to generate the actual database at the user; and (iii) most importantly, the statistical knowledge 
enables us to find the optimal input to output probabilistic mapping (i.e., a perturbation matched to the 
source statistics) that satisfy specific utility and privacy measures. The power of our approach is that 
it completely eliminates signal-perturbation mismatch problems as observed in privacy-preserving data 
mining solutions by Kargupta et al [18]; furthermore, the irreversibility of the quantization process implies 
that the suppressed or distorted data cannot be reversed despite knowledge of the actual statistics. In the 
following Section, we formalize these notions and present a rigorous analysis. 



A. Model for Databases 

A database V is a matrix whose rows and columns represent the individual entries and their attributes, 
respectively. For example, the attributes of a healthcare database can include name, address, SSN, gender, 
and a collection of possible medical information. The attributes that directly give away information such 
as name and SSN are typically considered private data. 

Model: Our proposed model focuses on large databases with K attributes per entry. Let Xf., for all 
k € /C = {1,2,,..., K}, and Z be finite sets. Let £ Xk be a random variable denoting the k th 
attribute, k = 1, 2, . . . , K, and let Xjc = (X±, X2, ■ ■ ■ , Xk)- A database d with n rows is a sequence of 
n independent observations from the distribution having a probability distribution 



IV. Model and Metrics 



Vx K (xk) = Px 1 x 2 ...x K {xi,x 2 , ■ ■ ■ 



,xk) 



(1) 
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which is assumed to be known to both the designers and users of the database. Our simplifying assumption 
of row independence holds generally in large databases (but not always) as correlation typically arises 
across attributes and can be ignored across entries given the size of the database. We write = 
(X^jX^, ■ ■ ■ ,Xft) to denote the n independent and identically distributed (i.i.d.) observations of X£. 

The joint distribution in CO) models the fact that the attributes corresponding to an individual entry are 
correlated in general and consequently can reveal information about one another. 

Public and private attributes: We consider a general model in which some attributes need to be kept 
private while the source can reveal a function of some or all of the attributes. We write lC r and K-h to 
denote sets of private (subscript h for hidden) and public (subscript r for revealed) attributes, respectively, 
such that K r U /Qj = JC = {1, 2, . . . , K}. We further denote the corresponding collections of public and 
private attributes by X/c r = {Xk} k€)C and X)c h = {X k } k£!Ch , respectively. More generally, we write 
X$ h = {X k : k € Sh C ICh} and Xs r = {X k : k 6 S r C /C r } to denote subsets of private and public 
attributes, respectively. 

Our notation allows for an attribute to be both public and private; this is to account for the fact that 
a database may need to reveal a function of an attribute while keeping the attribute itself private. In 
general, a database can choose to keep public (or private) one or more attributes (K > 1). Irrespective of 
the number of private attributes, a non-zero utility results only when the database reveals an appropriate 
function of some or all of its attributes. 

Revealed attributes and side information: As discussed in the previous section, the public attributes 
are in general sanitized/distorted prior to being revealed in order to reduce possible inferences about 
the private attributes. We denote the resulting revealed attributes as Xjc r = {X k }k&K r - m addition to 
the revealed information, a user of a database can have access to correlated side information from other 
information sources. We model the side information (s.i.) as an n-length sequence Z n = (Zi, Z2, ■ ■ ■ , Z n ), 
Zi € Z for all i, which is correlated with the database entries via a joint distribution px^z (xic,z). 

Reconstructed database: The final reconstructed database at the user will be either a database of 
revealed public attributes (when no s.i. is available) or a database generated from a combination of the 
revealed public attributes and the side information (when s.i. is available). 

B. Metrics: The Privacy and Utility Principle 

Even though utility and privacy measures tend to be specific to the application, there is a fundamental 
principle that unifies all these measures in the abstract domain. A user perceives the utility of a perturbed 
database to be high as long as the response is similar to the response of the unperturbed database; 
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thus, the utility is highest of an unperturbed database and goes to zero when the perturbed database is 
completely unrelated to the original database. Accordingly, our utility metric is an appropriately chosen 
average 'distance' function between the original and the perturbed databases. 

Privacy, on the other hand, is maximized when the perturbed response is completely independent of the 
data. Our privacy metric measures the difficulty of extracting any private information from the response, 
i.e., the amount of uncertainty or equivocation about the private attributes given the response. One could 
alternately quantify the privacy loss from revealing data as the mutual information between the private 
attributes and the response; mutual information is typically used to quantify leakage (or secrecy) for 
continuous valued data. 

C. Utility and Privacy Aware Encoding 

Since database sanitization is traditionally the process of distorting the data to achieve some measure 
of privacy, it is a problem of mapping a database to a different one subject to specific utility and privacy 
requirements. 

Mapping: Our notation below relies on this abstraction. Let Xk, k € /C, and Z, be as above and let 
Xj be additional finite sets for all j E fC r . Recall that a database d with n rows is an instantiation of 
Xfc. Thus, we will henceforth refer to a real database d as an input database and to the corresponding 
sanitized database (SDB) d s as an output database. When the user has access to side information, the 
reconstructed database d' at the user will in general be different from the output database. 

Our coding scheme consists of an encoder F E which is a mapping from the set of all input databases 
(i.e., all databases d allowable by the underlying distribution) to a set of indices J = {1, 2, . . . , M} and 
an associated table of output databases (each of which is a d s ) given by 



where )C r C /C enc C /C and M is the number of output (sanitized) databases created from the set of all 
input databases. To allow for the case where an attribute can be both public and private, we allow the 
encoding F E in © to include both public and private attributes. A user with a view of the SDB (i.e., 
an index j G J) and with access to side information Z n , whose entries Zj, i = 1, 2, . . . , n, take values 
in the alphabet Z, reconstructs the database d! via the mapping 



F E :(X?x...x X£) kelCenc -> J = {SDB k } 



M 
k=l 



(2) 




(3) 



The encoding and decoding are assumed known at both parties. 
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Utility: Relying on a distance based utility principle, we model the utility u via the requirement that 
the average distortion of the public variables is upper bounded, for each e > and all sufficiently large 

n, as 



u = E 



-H?=lP[ X fC r ,i, X /C r , 



<D + e, (4) 



where p (-, •) denotes a distortion function, E is the expectation over the joint distribution of (X/c r , -Xjc r ), 
and the subscript i denotes the i th entry of the database. Examples of distortion functions include the 
Euclidean distance for Gaussian distributions, the Hamming distance for binary input and output databases, 
and the Kullback-Leibler (K-L) divergence. We assume that D takes values in a closed compact set to 
ensure that the maximal and minimal distortions are finite and all possible distortion values between 
these extremes can be achieved. 

Privacy: We quantify the equivocation e of all the private variables using entropy as 

e = ~H{Xl h \ J, Z n ) >E-e. (5) 
n 

Analogous to ([5]), we can quantify the privacy leakage I using mutual information as 

l = -l(X^ h ;J,Z n ) <L + e. (6) 
n 

Remark 1: The case in which side information is not available at the user is obtained by simply setting 
Z n = in © and ©. 

We shall henceforth focus on using equivocation as a privacy metric except for the case where the 
source is modeled as continuous valued data since unlike differential entropy, mutual information is 
strictly non-negative. From ©, we have H{X Kh \X Kr , Z) < E < H(X Kh \Z) < H(X Kh ), where the 
upper bound on the equivocation results when the private and public attributes (and side information) are 
uncorrelated and the lower bound results when the public attributes (and side information) completely 
preserve the correlation between the public and private attributes. Note that the leakage can be analogously 
bound as < I{X Kh ;Z) <L< I{X Kh ;X Krj Z). 

The mappings in © and (f3]) ensure that d is mapped to d' such that the U-P constraints in (@]) and d5) 
are met. The formalism in fl}-© is analogous to lossy compression in that a source database is mapped 
to one of M quantized databases that are designed a priori. For a chosen encoding, a database realization 
is mapped to the appropriate quantized database, subject to (0]) and (f5]). It suffices to communicate the 
index J of the resulting quantized database as formalized in © to the user. This index, in conjunction 
with side information, if any, enables a reconstruction at the user as in ©. Note that the mappings in © 
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and (0), i.e., lossy compression with privacy guarantees, ensure that for any D > 0, the user can only 
reconstruct the database d = X£ , formally a function f (J, Z n ), and not d = X£ itself. 

The utility and privacy metrics in © and d5j capture the statistical nature of the problem, i.e., the 
fact that the entries of the database statistically mirror the distribution CD- Thus, both metrics represent 
averages across all database instantiations d, and hence, (assuming stationarity and large n) over the 
sample space of X]q thereby quantifying the average distortion (utility) and equivocation (privacy) 
achievable per entry. 

Remark 2: In general, a database may need to satisfy utility constraints for any collection of subsets 

(I) (m) 

Sr C fC r of attributes and privacy constraints on all possible subsets of private attributes , m = 
1,2, ... ,L p , 1 < L p < — 1 where |/C^| is the cardinality of fCh- For ease of exposition and without 
loss of generality, we develop the results for the case of utility and privacy constraints on the set of all 
public and private attributes. The results can be generalized in a straightforward manner to constraints 
on arbitrary subsets. 

V. Utility-Privacy Tradeoffs 

Mapping utility to distortion and privacy to information uncertainty via entropy (or leakage via mutual 
information) leads to the following definition of the U-P tradeoff region. 

Definition 1: The U-P tradeoff region T is the set of all feasible U-P tuples {D,E) for which there 
exists a coding scheme (Fe, Fd) given by Q and ©, respectively, with parameters (n, M, u, e) satisfying 
the constraints in dU) and ©. 

While the U-P tradeoff region in Definition Q] can be determined for specific database examples, one 
has to, in general, resort to numerical techniques to solve the optimization problem [19]. To obtain closed 
form solutions that define the set of all tradeoff points and identify the optimal encoding schemes, we 
exploit the rich set of techniques from rate distortion theory with and without equivocation constraints. 
To this end, we study a more general problem of RDE by introducing an additional rate constraint 
M < 2 n ( fi+e ) which bounds the number of quantized SDBs in ©. Besides enabling the use of known 
rate-distortion techniques, the rate constraint also has an operational significance. For a desired level of 
accuracy (utility) D, the rate R is the precision required on average (over Xk) to achieve it. We now 
define the achievable RDE region as follows. 

Definition 2: The RDE region TZrde is the set of all tuples (R, D, E) for which there exists a coding 
scheme given by (fSJ) and © with parameters (n, M, u, e) satisfying the constraints in (0]), d5), and on 
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Fig. 3. (a) Rate Distortion Equivocation Region [1]; (b) Utility-Privacy Tradeoff Region. 



the rate. In this region, TZd-e, the set of all feasible distortion-equivocation tuples (D, E) is defined as 



The RDE problem differs from the distortion-equivocation problem in including a constraint on the 
precision of the public variables in addition to the equivocation constraint on the private data in both 
problems. Thus, in the RDE problem, for a desired utility D, one obtains the set of all rate-equivocation 
tradeoff points (R, E) , and therefore, over all distortion choices, the resulting region contains the set of 
all (D,E) pairs. From Definitions Q] and we thus have the following proposition. 

Proposition 1: T = TZd-e- 

Proposition Q] is captured pictorially in Fig. |3jb). The functions R(D,E) and F(D) in Fig. [3] capture 
the rate and privacy boundaries of the region and are the minimal rate and maximal privacy achievable, 
respectively, for a given distortion D. 

The power of Proposition Q] is that it allows us to study the larger problem of database U-P tradeoffs 
in terms of a relatively familiar problem of source coding with additional privacy constraints. Our result 
shows the tradeoff between utility (distortion), privacy (equivocation), and precision (rate) - fixing the 
value of any one determines the set of operating points for the other two; for example, fixing the utility 
(distortion D) quantifies the set of all achievable privacy-precision tuples (E,R). 

For the case of no side information, i.e., for the problem in ©-([5]) with Z n = 0, the RDE region 
was obtained by Yamamoto [1] for K r = = 1 and K r D Kh = 0- We henceforth refer to this as 



TZd^e = {{D, E) : (R, D, E) € Krde, R>0}- 



(7) 
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an uninformed case, since neither the encoder (database) nor the decoder (user) have access to external 
information sources. We summarize the result below in the context of a utility-privacy tradeoff region. 
We first summarize the intuition behind the results and the encoding scheme achieving it. 

In general, to obtain the set of all achievable RDE tuples, one follows two steps: the first is to obtain 
{outer) bounds for a (n, M, u, e) code on the rate and equivocation required to decode reliably with a 
distortion D (vanishing error probability in decoding for a bounded distortion D); the second step is a 
constructive coding scheme for which one determines the inner bounds on rate and equivocation. The 
set of all (R, D, E) tuples is achievable when the two bounds meet. The achievable RDE region was 
developed in [1, Appendix] for the problem in |2] Focusing on the set of all RDE tradeoff points, we 
restate the results in [1, Appendix] as follows. 

Proposition 2: Given a database with public, private, and reconstructed variables Xjc r , Xjc h , and X/c r 
respectively, and Z = 0, for a fixed target distortion D, the set of achievable (R, E) tuples satisfy 



for some p(xfc h ,x;c r ,Xfc r ) such that E(d(Xfc r , Xjc r )) < D. 

Remark 3: The distribution p(x;c h ,xic r ,x/c r ) allows for two cases, one in which both the public and 
private attributes are used to encode (e.g., medical) and the other in which only the public (e.g., census) 
attributes are used. For the latter case in which the private attributes are only implicitly used (via the 
correlation), the distribution simplifies as p(xjc h ,Xfc r .)p(xic r \x!c h ), i.e., the variables satisfy the Markov 
chain X Kh - X Kr - X Kr . 

Theorem 1: The U-P tradeoff region for a database problem defined by CD - © and with Z n = is 
the set of all (E, D) such that for every choice of distortion D S V that is achievable by quantization 
scheme with a distribution p(x;c h , x;c r X)c r ), the privacy achievable is given by Ejj(D) in (l8bl (for which 
a rate of Ru (D) in (l8ab is required). 

The set of all RDE tuples in ([8]) define the region 7l* RDE - The functions in Fig. [3] specifying the 
boundaries of this region are given as follows: R{D,E) which is the minimal rate required for any 
choice of distortion D is given by 



R>Ru (D) = I(X Kr X Kh ;X Kr ) 



(8a) 



E<Eu{D) = H{X Kh \X Kr ) 



(8b) 



R (D, E) = R(D,E*) 



mm 

p(x Kh ,X Kr ,Xlc r ) 



Ru (D) 



(9) 
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where E* = Ejj{D)\ p * is evaluated at p* is the argument of the optimization in © and T(D) which is 
the maximal equivocation achievable for a desired distortion D is given by 

T{D) = . max E v (D) . (10) 

Remark 4: In general, the functions R (D, E) and V (D) may not be optimized by the same distribution 
p(xjc h ,xjc r ,xjc r ), i.e., R{D,E) may be minimal for a E = E* < r(Z?). This implies that in general 
the minimal rate encoding scheme is not necessarily the same as the encoding scheme that maximizes 
equivocation (privacy) for a given distortion D. This is because a compression scheme that only satisfies 
a fidelity constraint on X/c r , i.e., source coding without additional privacy constraints, is oblivious of the 
resulting leakage of Xjc h whereas a compression scheme which minimizes the leakage of X;c h while 
revealing Xfc r will first reveal that part of Xjc r that is orthogonal to Xjc h and only reveal X]Q h when 
the fidelity requirements are high enough to encode it. Thus, maximal privacy may require additional 
precision (of the component of X/c r orthogonal to X/c h ) relative to the fidelity-only case. The additional 
rate constraint enables us to intuitively understand the nature of the lossy compression scheme required 
when privacy need to be guaranteed. 

We now focus on the case in which the user has access to correlated side information. The resulting 
RDE tradeoff theorems generalize the results in [1]; furthermore, we present a new relatively easier 
proof for the achievable equivocation while introducing a class of encoding schemes that we refer to as 
quantize -and-bin coding (see also [20]). 

A. Capturing the Effects of Side-Information 

In general, a user can have access to auxiliary information either from prior interactions with the 
database or from a correlated external source. We cast this problem in information-theoretic terms as 
a database encoding problem with side information at the user. Two cases arise in this context: i) the 
database has knowledge of the side information due to prior interactions with the user and is sharing 
a related but differently sanitized view in the current interaction, i.e., an informed encoder; and ii) the 
database does not know the exact side information but has some statistical knowledge, i.e., an statistically 
informed encoder. We develop the RDE regions for both cases below. 

I) U-P Tradeoffs: Statistically Informed Encoder: We first focus on the case with side information 
at the user and knowledge of its statistics at the encoder, i.e., at the database. The following theorem 
quantifies the RDE region, and hence, the utility-privacy tradeoff region for this case. 
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Theorem 2: For a target distortion D, the set of achievable (R, E) tuples when the database has access 
to the statistics of the side information is given as 

R>Rsi{D) = I{X K X Kh -U\Z) (11a) 
E<E S i{D) = H{X Kh \UZ) (lib) 

for some distribution p(xjc h , xic r , z)p (u\xic h , XK, r ) such that there exists a function Xfc r = f(U,Z) for 
which E \d(X)c r ,Xje r )\ < D, and \U\ = \X K \ + 1. 

Remark 5: For the case in which only the public variables are used in encoding, i.e., X^ h — X/c r — U, 

We prove Theorem |2] in the Appendix. Here, we present a sketch of the achievability proof. The main 
idea is to show that a quantize-and-bin encoding scheme achieves the RDE tradeoff. 

The intuition behind the quantize-and-bin coding scheme is as follows; the source \X^- , XJ^-^ j is 
first quantized to U n at a rate of I(X^ X^ ;U). For the uninformed case, the encoder would have 
simply sent the index for U n (= X£ ) to the decoder. However, since the encoder has statistical 
knowledge of the decoder's side information, the encoder further bins U n to reduce the transmission 
rate to I(Xjc r X)c h ;U) — I(Z;U) where I(Z;U) is a measure of the correlation between Z n and U n . 
The encoder then transmits this bin index J so that using J and Z n , the user can losslessly reconstruct 
U n , and hence, X£ = / (U n , Z n ) via a deterministic function / to the desired D. 

The outer bounds follow along the lines of the Wyner-Ziv converse as well as outer bounds on the 
equivocation (see the Appendix). The key result here is the inner bound on the equivocation, i.e., for a 
fixed distortion D, the quantize-and-bin encoding scheme can guarantee a lower bound on the equivocation 
as H(Xjc h \U, Z) which primarily relies on the fact that using the bin index J and side information Z n , 
the quantized database U n can be losslessly reconstructed at the user. 

Uninformed case: Here, we have Z = and U = X/c r , i.e., the reconstructed and sanitized databases 
are the same. Note that in this case, the quantize-and-bin scheme simplifies to a simple quantize scheme 
(as required to achieve Proposition 0. 

Remark 6: For a desired D, minimizing Rsi(D) yields the Wyner-Ziv rate-distortion function. How- 
ever, we focus here on the tradeoff region, and hence, the set of all (R, D, E) tuples. 

2) U-P Tradeoffs: Informed Encoder: We now consider the case in which the encoder also has perfect 
knowledge of the side information. Such a case can arise in practice if the encoder has shared some prior 
information related to the database earlier. The following theorem summarizes the RDE tradeoff region 
for this case. 
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Theorem 3: For a target distortion D, the set of achievable (R, E) tuples when the encoder has perfect 
knowledge of the side information is given as 

R>Ri(D) = I(X Kr ,X Kh ;X Kr \Z) (12a) 
E<Ej(D) = H(Xjc h \X Kr Z) (12b) 

for some distribution p(xjc h , xic r , z)p (&jc r |a?JC A , %lC r , z) for which E d(Xjc r ,Xic r ) < D. 
Remark 7: For Z n = 0, Theorem [3] simplifies to Proposition [2] 

We prove Theorem [3] in the Appendix. The main idea is to show that an informed quantize-and-bin 
encoding scheme for the informed case in which both (XS, Z n ) are available at the encoder achieves 
the RDE tradeoff. The encoder jointly compresses them to a database X£ which it further bins and 
reveals the bin index to the decoder such that the rate of transmission reduces to I{XjcZ;Xk, t ) — 
I(Z;Xjc r ) = I(Xjc; Xjc r \Z). Using the bin index and side information Z n , the database Xji- can be 
losslessly reconstructed. The outer bounds follow from standard results on conditional rate-distortion 
converse (see the Appendix). The key result is the inner bound on the equivocation, i.e., for a fixed 
D, the quantize-and-forward scheme is shown to guarantee a minimal equivocation of H(Xjc h \Xjc r , Z) 
using the fact that from J and Z n , X^^ can be losslessly reconstructed at the user. 

VI. Illustration of Results 

In this Section, we apply the utility-privacy framework we have introduced to model two fundamental 
types of databases and illustrate the corresponding optimal coding schemes that achieve the set of 
all utility-privacy tradeoff points. More importantly, we demonstrate how the optimal input to output 
probabilistic mapping (coding scheme) in each case sheds light on practical privacy-preserving techniques. 
We note that for the i.i.d. source model considered, vector quantization (to determine the set of M output 
databases) simplifies to finding the probabilities of mapping the letters of the source to letters of the 
output (database) alphabet as formally shown in the previous Section. 

We model two broad classes of databases: categorical and numerical. Categorical data are typically 
discrete data sets comprising information such as gender, social security numbers and zip codes that 
provide (meaningful) utility only if they are mapped within their own set. On the other hand, without 
loss of generality, numerical data can be assumed to belong to the set of real numbers or integers as 
appropriate. In general, a database will have a mixture of categorical and numerical attributes, but for the 
purpose of illustration, we assume that the database is of one type or the other, i.e., every attribute is of 
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the same kind. In both cases, we assume a single utility (distortion) function. We discuss each example 
in detail below. 

Recall that the abstract mapping in ([2]) is a lossy compression of the database. The underlying principle 
of optimal lossy compression is that the number of bits required to represent a sample x of X ~ px is 
inversely proportional to log (p(x)), and thus, for a desired D, preserving the events in descending order 
of px requires the least number of bits on average. The intuitive notion of privacy as being unidentifiable 
in a crowd is captured in this information-theoretic formulation since the low probability entries, the 
outliers, that convey the most information, are the least represented. It is this fundamental notion that is 
captured in both examples. 

Example 1: Consider a categorical database with K > 1 attributes. In general, the k th attribute 
takes values in a discrete set X^ of cardinality M^. For our example, we assume that all attributes need 
to be revealed, and therefore, it suffices to view each entry (a row of all K attributes) of the database as 
generated from a discrete scalar source X of cardinality M, i.e., X ~ p(x), i £ {1,2,.., ,M}. Taking 
into account the fact that sanitizing categorical data requires mapping within the same set, for this arbitrary 
discrete source model, we assume that the output sample space X = X. Since changing a sample of the 
categorical data can significantly change the utility of the data, we account for this via a utility function 
that penalizes such changes. We thus model the utility function as a generalized Hamming distortion 
which captures this cost model (averaged over all samples of X) such that the average distortion D is 
given by 

D = Prjx / . (13) 

Focusing on the problem of revealing the entire database d = X n (a n-sequence realization of X) as 
X n , we define the equivocation as 

-H(X n \X n ) > E. (14) 

n 

Thus, the utility-privacy problem is that of finding the set of all (D, E) pairs such that for every choice 
of p{x\x) achieving a desired D, the equivocation is bounded as in (fT4l . Applying Proposition [2] (and 
also Theorem [3] with Z n = 0), we have that for a target distortion D, the set of achievable (R, E) tuples 
satisfy 

R > Ru (D) = I(X; X); E < E v (D) = H(X\X) (15a) 

for some distribution p(x)p(x\x) for which E d(X,X) < D. Note that the rate Ru (D) = H(X) - 
Ejj{D), and thus, minimizing Rjj (D) for a desired D maximizes E\j (D) . Thus, while (031 ) defines the 
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set of all (R,D,E) tuples, we focus on the (D,E) pairs for which maximal equivocation (privacy) is 
achieved. 

The problem of minimizing Ru (D) for an arbitrary source with a generalized Hamming distortion 
has been studied in [21] who showed that R{D) is achieved by reverse waterfilling solution such that 

V(x) = ( ^ } " A)+ + (16) 

and the 'test channel' (mapping from X to X) is given by 

D, x = x 

p(x\x) = < A, x^x,x e X mm> (17) 
Pki x = k Af SU pp 

where D = 1 — D, A is chosen such that J2 x p( x )p( x \ x ) = p( x )> Vk = p(x = k), and = 
{x : p(x) — A > 0} . Let S = X %upv — 1. The maximal achievable equivocation, and hence, the largest 
utility-privacy tradeoff region is 

r(D) = -7Jlog7J-5AlogA- PklogPk- (18) 

The waterlevel A is the Lagrangian for the distortion constraint in minimizing Ru (D). The distribution of 
entries in d in (TToT ) demonstrates that the source samples with low probabilities relative to the water level 
are not preserved, leading to a 'flattening' of the output distribution. Thus, we see that the commonly used 
heuristics of outlier suppression, aggregation, and imputation [7], [8] on census and related databases can 
be formally shown to minimize privacy leakage for the appropriate model. We illustrate our results in 
Fig.Hfor p x (x) = [0.25 0.25 0.15 0.1 0.04 0.005 0.003 0.002] in which the first subplot demonstrates 
increased suppression of the outliers with increasing D, and the second shows the entire U-P region. 

Interpretation: The probability p(x) is the assumed probability of occurrence of each unique sample 
(e.g., names such as Smith, Johnson, Poor, Sankar, etc.) in the database. For categorical data, the attribute 
space for the input and output databases are assumed to be the same (e.g., names mapped to names). 
The Hamming distortion measure we have chosen quantifies the average probability of a true sample of 
the source being mapped to a different sample in the output database (e.g., probability that a name in the 
input database is mapped to a different name in the output database averaged over all names). The output 
distribution in (fT6l ) implies that for a desired utility (quantified via a Hamming distortion D), all the input 
samples with probabilities below a certain A (e.g., say 'Sankar,' a very low probability name) will not 
be present in the output database. The water-level A is chosen such that the input and output database 
samples satisfy D in ( fT3l ). Thus, the probability of guessing that Sankar was in the original database given 
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Fig. 4. a) Reverse WF distributions for D=0. 1,0.25,0.5; b) U-P tradeoff region. 



one only sees Smith, Johnson, and Poor is given by (fTTT ) and is the same as the probability of Sankar 
in the original database, i.e., there is no reduction in uncertainty about Sankar given the published data! 
Furthermore, given that the name Smith is published, the probability that Smith resulted from others such 
as Johnson, Poor, and Sankar as well as from Smith is also given by (fTTT) . This shows that every sample 
in the output database contains some uncertainty about the actual sample with maximal uncertainty for 
those suppressed. Our mapping not only mathematically minimizes the leakage of the original samples 
but also does so to provide privacy to all and maximally to those who are viewed as outliers (relative 
to the utility measure). For simplicity, we have chosen a single private attribute, name, in this example. 
In general, there could be several correlated attributes (e.g. name and last four digits of the SSN) that 
will be changed together. This is captured by our joint distribution. This eliminates the possibility that 
the adversary uses his knowledge of the distribution to tell which individual entries have been changed. 
The use of Hamming distortion measure in this example illustrates another aspect of the power of our 
model. Sanitization of non-numeric data attributes in a utility-preserving way is hard to do, especially 
because distance metrics for non-numeric data tend to be application-specific. Hamming distortion is an 
example of an extreme measure that penalizes every change uniformly, no matter how small the change. 
It may be appropriate to use this measure for applications that are especially sensitive to utility loss. 

Example 2: In this example we model a numerical (e.g. medical) database in which the attributes such 
as weight and blood pressure are often assumed to be normally (Gaussian) distributed. Specifically, we 
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consider a K = 2 database with a public X (= X r ) and a private Y (= X^) attribute such that X and Y 
are jointly Gaussian with zero means and variances o\ and ciy, respectively, and a correlation coefficient 
Pxy = E [XY] I (ox<yy). We assume that only X is encoded such that Y — X — X holds. We consider 
three cases: (i) no side information, (ii) side information Z n at user, and (iii) Z n at both. For the cases 
with Z n , we assume that Z is i.i.d. zero mean with variance a\ and is jointly Gaussian with (X, Y) 
such that Y — X — Z forms a Markov chain and has a correlation coefficient pxz = E [XZ] / (ax&z)- 
We use the leakage L in © as the privacy metric. 

Case (i): No side information: The (R, D, L) region for this case can be obtained directly from 
Proposition |2] in dD with X Kr = X and E v (D) replaced by L v (D) = I(Y; X). For a Gaussian (X, Y) , 
one can easily verify that, for a desired D, both Rjj(D) and Lu{D) are minimized by a Gaussian X 
[17, Chap. 10], i.e., for normally distributed databases, the privacy -maximizing revealed database is 
also normally distributed. Furthermore, due to Y — X — X, the minimization of I(X; X) is strictly 
over p(x\x), and thus, simplifies to the familiar R-D problem for a Gaussian source that is achieved by 
choosing X = X + N, where the noise iV ~ M (0, ajj) is independent of X and its variance afj is 
chosen such that D = Evar (^X\X^j € [0, a x ] where var denotes variance. The resulting minimal rate 
and leakage achieved (in bits per entry) are, for D G [0, a x ] , 

LUD)=l ^{w^)U^m)- 

The largest U-P tradeoff region is thus the region enclosed by L(D). 

Case (ii): For the statistically informed encoder, the (R, D, L) region is given by (fTTb with Esi (D) 
replaced by L$i (D) = I(Y;UZ). One can show the optimality of Gaussian encoding in minimizing 
both the rate and leakage in [TT] and thus, we have U = X + N, where N ~ TV (0, aj^) is independent of 
X and its variance a 2 N is chosen such that the distortion D = Evar (X\UZ) € [0, o~ x \. Computing the 
minimal rate R* SI (D) (the Wyner-Ziv rate [22]) and leakage L* SI (D) for a jointly Gaussian distribution 
achieving a distortion D, we obtain for all D G [0, a\ (1 — pxz)} , 

R* SI (D) = R WZ (D)= l -\og 
L* SI (D) = L* V (D), 

i.e., the minimal rate and leakage are independent of p XY and p\ z , respectively, and thus, user side 
information does not degrade privacy when the minimal-rate encoding is used. The access to side 
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Rate and Leakage: uninformed case 
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Fig. 5. Plot of Rate and Leakage vs. D for Cases (i), (ii), and (iii). 



information at the user implies that the maximal achievable distortion is at most as large as the uninformed 
case. Note that unlike L* v (D) which goes to zero at the maximal distortion of a x , L* SI (D) > for 
D = v\ (l ~~ Pxz) as a resu lt of the implicit correlation between Y and Z. These observations are 
clearly shown in Fig. [5] for o\ = 1 and different values of p XY and p\ z - 

Case (iii): Finally, for a Gaussian source model, the (R, D, L) region achievable for the informed 
encoder-decoder pair is the same as that for Case (ii). This is because of the no rate-loss property of 
Wyner-Ziv coding for a Gaussian source, i.e., knowledge of the side information statistics at the encoder 
suffices to remove the correlation from each entry before sharing data with the user [23]. Furthermore, 
since Gaussian outputs minimize the rate as well as the leakage, the minimal R* T (D) = R$j (D) and 
L* (D) = L* SI (D) (see Fig. |5J 

Interpretation: The RDL and U-P tradeoffs for the Gaussian models considered here reveal that the 
privacy-maximal code requires that the reconstructed database is also Gaussian distributed. This in turn is a 
direct result of the following fact: a Gaussian distribution has the maximal (conditional and unconditional) 
entropy (uncertainty) for a fixed variance [17, Chap 8, Th. 8.6.5] (and hence, a fixed mean-squared 
distortion between the input and output databases). Thus, if one wishes to preserve the most uncertainty 
about the original input database from the output, the output must also be Gaussian distributed, i.e., it 
suffices to add Gaussian noise, since the sum of two Gaussians is a Gaussian. The power of our model 
and the results are that not only can one find the privacy-optimal noise perturbation for the Gaussian 
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case but that practical applications such as medical analytics that assume Gaussian-distributed data can 
still work on sanitized data, albeit with modified parameter values. 

In [18], it was noted that Gaussian noise is often the easiest to filter and this observation may seem 
to be in conflict with our result - if the added noise can be filtered out, the privacy protection afforded 
by the added noise can be reduced by the adversary. However, what [18] actually shows is that when 
the spectra of the noise and the data differ significantly the noise can be filtered, thereby jeopardizing 
privacy measures. For the i.i.d. source model (i.e., a source with no memory) considered here, the i.i.d. 
Gaussian noise that is added to guarantee privacy has the same flat power spectral density as the source, 
and thus, the perturbed data cannot be distinguished from the added noise. In fact, the quantization that 
underlies the information-theoretic sanitization mechanism developed here is an irreversible process and 
one cannot obtain the original data except for D = (i.e., the case of no sanitization). As a point of 
comparison, we note that in a separate work on privacy of streaming data (non-i.i.d time-series data 
modeled as a colored Gaussian process, i.e. data that has non-flat spectrum), we have shown that the 
privacy-optimal noise perturbation requires the spectrum of the added noise to be non-flat to match that 
of the non-i.i.d. data [2]. 

Our example also reveals how finding the optimal santization mechanism, i.e., the optimal mapping 
from the original public to the revealed attributes depends both on the statistical model. In fact, it is for 
this reason that adding Gaussian noise for any numerical database will not, in general, be optimal unless 
the database statistics can be approximated by a Gaussian distribution. 

VII. Concluding Remarks 

The ability to achieve the desired level of privacy while guaranteeing a minimal level of utility and 
vice-versa for a general data source is paramount. Our work defines privacy and utility as fundamental 
characteristics of data sources that may be in conflict and can be traded off. This is one of the earliest 
attempts at systematically applying information theoretic techniques to this problem. Using rate-distortion 
theory, we have developed a U-P tradeoff region for i.i.d. data sources with known distribution. 

We have presented a theoretical treatment of a universal (i.e. not dependent on specific data features 
or adversarial assumptions) theory for privacy and utility that addresses both numeric and categorical 
(non-numeric) data. We have proposed a novel notion of privacy based on guarding existing uncertainty 
about hidden data that is intuitive but also supported by rigorous theory. Prior to our work there was no 
comparable model that applied to both data types, so no side-by-side comparisons can be made across the 
board between different approaches. The examples developed here are the first step towards understanding 
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practical approaches with precise guarantees. The next step would be to pick specific sample domains 
(e.g., medical data, census data), devise the appropriate statistical distributions and U-P metrics, set 
desirable levels of privacy and utility parameters, and then analyze on test data. These topics for future 
research however require the theoretical framework proposed here as a crucial first step. 

Several challenges remain in quantifying utility-privacy tradeoffs for more general sources. For example, 
our model needs to be generalized for non-i.i.d. data sources, sources with unknown distributions, and 
sources lacking strong structural properties (such as Web searches). Results from rate-distortion theory 
for sources-wifh-memory and universal lossy compression may help address these challenges. Farther 
afield, our privacy guarantee is an average metric based on Shannon entropy which may be inadequate 
for some applications where strong anonymity guarantees are required for every individual in a database 
(such as an HIV database). Finally, we have recently extended this framework to privacy applications 
with time-series sources [2] and organizational data disclosure [24]. 



A. Proofs of Theorems \2\ and \3\ 

1 ) Statistically Informed Case: Proof of Theorem |2]- Converse: We now formally develop lower and 
upper bounds on the rate and equivocation, respectively, that is achievable for the statistically informed en- 
coder case. We show that given a (n, 2 n ( R+<L \ D+e, E—e) code there exists a p(xjc r , xjc h , z)p (u\xjc r , %K h ) 
such that the rate and equivocation of the system are bounded as follows: 



Appendix 



R + e> -logM> -H(J) > -I{J-Xl\Z n ) 



n n n 



= -{H{Xl\Z n )-H{X n K \JZ n )} 



(19) 



I n 

= — YlH{Xx.,i\Zi) 




(20) 



"In 1 n 

> -ZHiX^Zi) - -ZH (X^ZiUi) 




(21) 



1 n 

= -J2Rsi (A) 

n i=1 



(22) 



> Rsi (D) 



(23) 
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where X 1 ^ 1 = X(-),2 ■ ■ ■ * ^ 1» (ESb follows from the assumption of an i.i.d. source, 

d2TT > from the fact that conditioning does not increase entropy and by setting Ui = (J Z 1 ^ 1 Z™ +1 ) such 
that Ui — Xjc — Zi forms a Markov chain for all i, and X)c rt i = g% (J, Z n ) = f\ (Ui, Zi) for some g{ and 
fi, d22l from definition (111 at for 



A = E \d \XK,i,XK,i)\ , and E SLi = H^U^), 

and d23l ) from the convexity of the function R$i (D) defined in d 1 1 a|) (see [17, Chap. 10], [22]). 

For the same (n, 2 n (^ +<: ), D,E — e) code considered, we can upper bound the achievable equivocation 

as 

E-e<-H(X£ \JZ n ) 
n v h ' 

1 n 

= ~Y,H(x K ax£ Zi (jz^z? +l )) 

n i=l 
1 n 

< -Y,H {X Kh>i \ Zi Ui) (24) 
n i= i 

= iEi? 5/ (A) (25) 
n i=l 

< E S i (D) (26) 

where (T25T ) follows from (II lbl i and (1261 follows from the concavity of the equivocation (logarithm) 
function Esi- 

Remark 8: If the private variables Xj^ are not directly used in encoding, i.e., X^ h — X£ — U n form 
a Markov chain, then from the i.i.d. assumption of the source and the resulting encoding, the Markov 
chain Xjc h> i — X/c r! i — Ui holds for all i = 1, 2, . . . , n. 

Achievability: We briefly summarize the quantize-and-bin coding scheme for the statistically informed 
encoder case. Consider an input distribution p(u,xjc,z): 

p(u,X)c,z) =p(u,xk)p(z\xk), 

i.e., U — Xjc — Z forms a Markov chain. Fix p(u\xjc). First generate M = 2 n ^ I( - u '' XK ^ +e \ U n (w) 
databases, w = 1, 2, . . . , M, i.i.d. according to p(u). Let W denote the random variable for the index 
w. Next, for ease of notation, denote the following: 

S = 2 n/ ( X ' c ; C/ ) R = 2 n/ ( X <e;[/|Z) T = ^nI(U;Z) 

The encoder bins the u n (w) sequences into R bins as follows: 

J(u n {w)) = k,if w G [(k - 1)T + 1, kT]. 
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Upon observing a source sequence x^, the encoder searches for a u n (w) sequence such that (x^,u n (w j) G 
7x K u ( n : e) (the choice of M ensures that there exists at least one such w). The encoder sends J (w) 
where J (w) is the bin index of u n (w) sequence sent at a rate R = I(Xfc', U\Z) + e. 

This encoding scheme implies the decodability of U n sequence as follows: upon receiving the bin 
index J(u n (w)) = j, the uncertainty at the decoder about u n {w) is reduced. In particular, having the bin 
index j, it knows that there are only 2 n/ (' 7;Z ) possible u n sequences that could have resulted in the bin 
index j. It then uses joint typical decoding using Z n to decode the correct u n sequence (the probability 
of decoding error goes to zero as n — > oo by standard arguments as in the channel coding theorem). This 
implies that using Fano's inequality, the decoder having access to ( J, Z n ) can correctly W, and hence, 
decode U n (W) , with high probability, i.e., 



where 5{n) — > as n — > oo. 

2) Proof of Equivocation: For the quantize-and-bin scheme presented above, we will show that 



Our proof is based on the fact that for the chosen quantize-and-bin coding scheme, at the decoder 
given the bin index and side information, the uncertainty of the quantized sequences U n approaches zero 
for large n as shown in (|27T ). 

Consider the term I(XJ^ h ; J, U n , Z n ) which can be written as 



-H(W\J,Z n ) = -H(U n (W)\J,Z n ) < 5{n) 



(27) 



lim -H{Xl h \J,Z n ) > H(X Kh \U,Z) - e, 



which is equivalent to showing that 



lim -I{Xl h -J, Z n ) < I(X Kh ;U, Z) + e. 



I(Xl h ;J,Z n ) + I(Xl : ,U n \J,Z n ) 



(28a) 



(28b) 



i{xi h -u\z n ) + i{xi h -j\u\z 



■n 



(28c) 



<i{xi h -u\z n ) 



(28d) 




(28e) 



<n(I(X Kh ;U,Z)+5(n)) 



(28f) 



<n(I(X Kh ;U,Z) + e) 



(28g) 
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where (I28bb follows from (l27l i. (I28ct follows from (l27l and the fact that the mutual information is strictly 
non-negative, (I28db follows from the fact that there is no uncertainty in bin index J(W) given U n (W), 
d28eb follows from the i.i.d. assumption on the source and side information statistics, (|28fT > is proved inlBl 



below such that 6 (n) — > as n — > oo, and finally (28gl follows from choosing e > 5 (n) that determines 



the size M = 2 n (- R+<E ) of the codebook arbitrarily small as n — > oo. 

3) Informed Encoder Case: Proof of Theorem\3} Converse: We now formally develop lower and upper 
bounds on the rate and equivocation, respectively, that is achievable for the informed encoder case. The 
converse for the rate mirrors standard converse and we clarify the steps briefly. We show that given a 
(n, 2 n ( R+€ \ D + e,E — e) code there exists a p(xic r , x^ h , z)p (x/c r \xic r , x/c h , z) such that the rate and 
equivocation of the system are bounded as follows: 

R + e > -H(J) > -I (J; X£, Z n ) > J\Z n ) 
n n n 

in 1 n , 

> -^H(X K>i \Z i )--Y,H(x K)i \JZ n Xl 

n i=l n i=l v 

1 " 1 n ( 

n i=l n i=l V 

1 n 

= -J2Rsi(Di) (29) 

n i=l 

> Rsj (D) (30) 
where (l30l follows from the convexity of the function Rj (D) defined in (lllal) [17, Chap. 10] for 



and (31a) 



d yXjcijXjc^ 

E Li = H(X\X Kt d. (31b) 
For the same (n, 2 n ( R+e \ D, E — e) code considered, we can upper bound the achievable equivocation 



as 

1 



E-t<-H{Xl h \JZ n ) 



1 n 

-^H[x Kh/ \X^Z n JX n Kr ) (32) 

1 n ( \ 
< -Y,H [XK^ZiXKrA (33) 



n 



i=l 



1 n 

= -Y J E I {D i ) (34) 

n i=1 

< E l (D) (35) 
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where (1321 1 follows from the fact that the reconstructed database X£ is a function of the J and Z n , 



(l34l) follows from the fact that conditioning does not increase entropy, (l34l i follows from (13 lbb . and (l26l 
follows from the concavity of the equivocation (logarithm) function Ej. 

Remark 9: If the hidden variables X%- h are not directly used in encoding, i.e., X^ — X£ — X"£- form 
a Markov chain, then from the i.i.d. assumption of the source and the resulting encoding, the Markov 
chain X/c h! i — Xjc r; i — X]c r i holds for all i = 1, 2, . . . , n. 

Achiev ability: We briefly summarize the quantize-and-bin coding scheme for the informed encoder 
case. The encoding mirrors that for the statistically informed case and in the interest of space only the 
differences are highlighted below. The primary difference is that the database encoder now encodes both 
(X/c,Z) such that the input distribution p(x/c, x;c r , z) is 

i.e., Xjc r is a function of both X K and Z. This distribution is now used to generate M = 2 n ( / ( x ' c ^ x ' cZ )+ e ), 
X£ (w) sequences as before which are first quantized and then binned at a rate R = 2 nI ( x,c ' Xlc r\ z ) . 
Decoding follows analogously to the previous case, i.e., the decoder uses Z n and the bin index J to 
decode the correct sequence (the probability of decoding error goes to zero as n — > oo by standard 
arguments as in the channel coding theorem). This implies that using Fano's inequality, the decoder 
having access to (J, Z n ) can correctly decode W, and hence, X£ (W) , with high probability, i.e., 

-H(W\J,Z n ) = -H(Xl r (W)\J,Z n ) < e(n), (36) 
n n 

where e(n) — » as n — >• oo. 

Proof of equivocation: For the quantize-and-bin scheme presented above, we need to show that 

lim -H(Xl h \J,Z n ) > H{X Kh \X Kr ,Z)-e. 

n— >oo n 

Our proof is based on the fact that for the chosen quantize-and-bin coding scheme, at the decoder given 
the bin index J and side information Z n , the uncertainty of the quantized sequences X/c r approaches 
zero for large n as shown in (l36l ). The proof is the same as (l28l l with U = Xjc r along with (l36l ) and is 
omitted for brevity. 



B. Proof of (28f) 



Here, we prove the following inequality: 

H(XZ k \U», Z n ) < n(H(X Kh \U, Z) + e(n)). 
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For ease of exposition, let Y n = X™ h such that H(X% h \U n , Z n ) = H(Y n \U n ,Z n ) can be expanded 
and bounded as 

= Y p( u > z)H(Y n \U n = u,Z n = z) 

(u,z) 

= J2 P( u ' z)H{Y n \U n = u,Z n = z) 
(u,z)eT uz 

+ J2 P( u ' *)H{Y n \U n = u,Z n = z) 

(u,z)£7uz 

< Y p(u,z)H(Y n \U n = u,Z n = z) 

(u,z)eTuz 

(u,z)£Tuz 

< Y p(u,z)(Y n \U n = u,Z n = z) 
(u,z)eTuz 

+ nH(Y)5(n) 



= Y ^ u ' z ) 

(u,z)eTuz 

+ nH(Y)5(n) 

= Y ^ u ' z ) 

(u,z)eTuz 



^p(y|u,z)log(p(y|u,z)) 



Y p(yl u > z ) lo gG°(yl u > z )) 

ye7V| u , z 



^ p(y|u,z)log(p(y|u,z)) 



+ nif(y)5(n) 



(u,z)e7I/ Z 



p(y|u,z)log(p(y|u,z)) 

ye7V| U , z 



+ nH(X)5(n) + e(n) 
< n(tf(y|f7, Z) + 2e(n) + fl"(y)<5(n)) 
= n(H(Y\U,Z)+((n)), 

where ((n) — > as n — > oo. 
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Rate and Leakage: uninformed case 
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